Taxanorm: a novel taxa-specific normalization approach for microbiome data

Background In high-throughput sequencing studies, sequencing depth, which quantifies the total number of reads, varies across samples. Unequal sequencing depth can obscure true biological signals of interest and prevent direct comparisons between samples. To remove variability due to differential sequencing depth, taxa counts are usually normalized before downstream analysis. However, most existing normalization methods scale counts using size factors that are sample specific but not taxa specific, which can result in over- or under-correction for some taxa. Results We developed TaxaNorm, a novel normalization method based on a zero-inflated negative binomial model. This method assumes the effects of sequencing depth on mean and dispersion vary across taxa. Incorporating the zero-inflation part can better capture the nature of microbiome data. We also propose two corresponding diagnosis tests on the varying sequencing depth effect for validation. We find that TaxaNorm achieves comparable performance to existing methods in most simulation scenarios in downstream analysis and reaches a higher power for some cases. Specifically, it balances power and false discovery control well. When applying the method in a real dataset, TaxaNorm has improved performance when correcting technical bias. Conclusion TaxaNorm both sample- and taxon- specific bias by introducing an appropriate regression framework in the microbiome data, which aids in data interpretation and visualization. The ‘TaxaNorm’ R package is freely available through the CRAN repository https://CRAN.R-project.org/package=TaxaNorm and the source code can be downloaded at https://github.com/wangziyue57/TaxaNorm.

1 Details of TaxaNorm framework

ZINB distribution
Let f nb (y; µ, θ) denote the PMF of the NB distribution.The PMF of ZINB distribution is given by

Summary of scaling methods for comparison
The observed taxa count is divided by a size factor so that sequencing depth does not confound the reconstructed taxa count.A constant size factor is generally used for all taxa within a sample but varies by sample.Let Y ij be the observed count for taxon i(i = 1, ..., p) in sample j(j = 1, ..., n) and s j denote the sample-specific size factors.The normalized taxa count is calculated as r is the reference sample G * represents a set of taxa that were not considered as extreme data CSS       5 Additional real data results

Figure 1 :
Figure 1: Overview of the TaxaNorm workflow.The input data are the raw taxa count matrix and sequencing depth vector.A varying dispersion ZINB model was used to estimate parameters for each taxon.Two diagnosis tests (prevalence and equivalence) were applied to assess taxa characteristics.Quantile residuals were used as the normalized count.

Figure 2 :
Figure 2: Comparison of different methods in terms of (a) power and (b) FDR in identifying DA taxa using simulation datasets.The biological difference between groups is set to large (fold-change from U (4, 10)).Each panel shows a simulation scenario with various percentages of DA taxa on the x-axis.The corresponding methods are denoted by the colors shown at the side.The BH procedure is used to adjust for multiple testing with 5% as the nominal FDR level (dashed line).

Figure 3 :
Figure 3: Comparison of different methods in terms of (a) power and (b) FDR in identifying DA taxa using simulation datasets.The biological difference between groups is set to large (fold-change from U (4, 10)).Each panel shows a simulation scenario with various percentages of DA taxa on the x-axis.The corresponding methods are denoted by the colors shown at the side.The BH procedure is used to adjust for multiple testing with 5% as the nominal FDR level (dashed line).

Figure 4 :
Figure 4: Comparison of different methods in terms of (a) power and (b) FDR in identifying DA taxa using simulation datasets.The biological difference between groups is set to large (fold-change from U (4, 10)).Each panel shows a simulation scenario with various percentages of DA taxa on the x-axis.The corresponding methods are denoted by the colors shown at the side.The BH procedure is used to adjust for multiple testing with 5% as the nominal FDR level (dashed line).

Figure 5 :
Figure 5: Comparison of different methods in terms of (a) power and (b) FDR in identifying DA taxa using simulation datasets.Sequencing effect effect are equal across all taxa.The sample size for each group is 500 with 50% DA taxa.The corresponding methods are denoted by the colors shown at the side.The BH procedure is used to adjust for multiple testing with 5% as the nominal FDR level (dashed line).

4Figure 6 :
Figure6: QQ plot for for real vs.estimated parameters in a simulation dataset with 90% of taxa without the sequencing depth effect and a sample size of 1000.

Figure 7 :
Figure 7: Sequencing efficiency of each taxon in HMP 16S gut microbiota data.The relationship between raw taxa-specific counts and sequencing depth was estimated using ZINB regression.Densities of corresponding coefficients are colored by phylum rank for all taxa.

Figure 8 :
Figure 8: Sequencing efficiency of each taxon of fecal microbiota data from a diarrhea cohort.The relationship between raw taxa-specific counts and sequencing depth was estimated using ZINB regression.Densities of corresponding coefficients are colored by phylum rank for all taxa.

Figure 9 :
Figure 9: NMDS visualizations for raw HMP data.Two NMDS coordinates were used to present the data with BSS=1026.

Table 1 :
Simulation scenarios for model diagnosis tests.

Table 2 :
Table 3 lists the scaling normalization methods used for comparison in this paper.Simulation scenarios for DA analysis.

Table 3 :
Summary of scaling normalization methods3 Additional simulation results